Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Add per-metric error handling and fix other issues #230

Merged
merged 12 commits into from
Aug 30, 2021

Conversation

GJFR
Copy link
Member

@GJFR GJFR commented Aug 27, 2021

Unfortunately, the well-known.js custom metric did not so well in the previous crawls due to errors. If one parseResponse call fails, the whole metric fails. To solve this, I have separated error handling for each call.

The metric would also fail when the site's policy wouldn't allow to load another page (e.g. CORS). This is now also handled for each parseResponse call. Since it is not possible to resolve the underlying cause of these errors, I just indicate an error has occurred while fetching a URL.

Lastly, I also fixed a capitalisation issue for string comparisons in /robots.txt.

I also see a lot of 'The user aborted a request.' errors in the our crawl data. Does this have to do with fetchWithTimeout?

function fetchWithTimeout(url) {
  var controller = new AbortController();
  setTimeout(() => {controller.abort()}, 5000);
  return fetch(url, {signal: controller.signal});
}

WPT tests (focus on the errors in previous crawls)

@GJFR
Copy link
Member Author

GJFR commented Aug 27, 2021

Would it be possible to merge this in time for the September crawl?

@rviscomi rviscomi merged commit 1a7ce57 into HTTPArchive:master Aug 30, 2021
@jroakes
Copy link
Contributor

jroakes commented Oct 13, 2021

Hi @GJFR. In discussion with @VictorLeP on slack, he pointed me to this PR. I have updated a PR for a robots.txt extraction to capture some data for next year's web almanac.

I have finalized the PR here and included test cases: #236

The current robot.txt code in the well-known extraction script has two issues:

# Works for this
user-agent: Googlebot
disallow: /login/

# Does not work for this.  Will attribute the 'login' match to only Bingbot.
user-agent: Googlebot
user-agent: Bingbot
disallow: /login/
  1. It will match, for exampl, /book-cataloging/ and genuine-authentic-nike-shoes as secure routes.

I could update my robots.txt script to have the following output. I have already addressed #1 in my PR. I think I can make #2 better, but probably not perfect.

    "by_useragent": {
      "*": {
        "allow": 1,
        "disallow": 45,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 5,
        "other": 0
      },
      "baiduspider": {
        "allow": 0,
        "disallow": 163,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 2,
        "other": 0
      },
      ...
      }

I wanted to get a POV from the Security Team if we should consolidate or edit the well-known to correct some things.

@GJFR
Copy link
Member Author

GJFR commented Oct 14, 2021

Hi @jroakes. Thanks for the heads-up. As for the issues you mentioned:

  1. This issue should indeed be corrected, I was not aware that the second example is a valid robots.txt configuration. In the context of this year's Security-chapter, we can stil use the data luckily, since we're only interested in the endpoints that are covered.

  2. This issue was also discussed in the original PR: Security 2021 custom metrics #219. It was decided to do the filtering at the querying stage, such that the data collected by the custom metric stays consistent over the years. Because like you mentioned, this kind of filtering is difficult to get perfect.

@jroakes
Copy link
Contributor

jroakes commented Oct 14, 2021

Thanks @GJFR I think you make a really good point for pt.2 that I had not considered. FWIW, this is what I had tested in case it is helpful. It solves pt.1, and my solution for pt.2 was to 1) include oauth in keywords, and 2) exclude matches to matches that were preceded by a non-word character.

This also includes regex that should match:

disallow : useragent
disallow:     useragent
    disallow:  useragent

But doesn't clean first to solve for commented values (I did this in robots_txt.js):

disallow: /path/  #This is a secure route
text = $('html').textContent
let data = {'matched_disallows': {}};
let KEYWORDS = [
  'login',
  'log-in',
  'signin',
  'sign-in',
  'admin',
  'auth',
  'oauth',
  'sso',
  'account',
  'secure'
]

let currUserAgents = [];
let last_rule = null;
const rule_rx = new RegExp(`(user-agent|disallow)(?=\\s*:)`,'gi');
const kws = KEYWORDS.join('|');
const kw_rx = new RegExp(`[^\\w]+(${kws})`,'gi');

for(let line of text.split('\n')) {

  let matched = line.toLowerCase().match(rule_rx);

  if (matched){

    let rule = matched[0].trim();
    let value = line.slice(line.indexOf(':') + 1).trim();

    if (rule == 'user-agent') {
      
      if (last_rule == rule){
        currUserAgents.push(value);
      }else{
        currUserAgents = [value];
      }

    } else if (rule == 'disallow') {

      if (value.match(kw_rx)) {

        for (currAgent of currUserAgents){

          if (data['matched_disallows'][currAgent] === undefined) {
            data['matched_disallows'][currAgent] = [];
          }

          data['matched_disallows'][currAgent].push(value); 
        }

      }
    }

    last_rule = rule;
  }
  
}

JSON.stringify(data);

Here is test output for Linkedin:
https://gist.github.com/jroakes/bb5ac67ee4584f58f5a7714d62605da1

I am not that great with JS, so sharing for a POC example only that can be improved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants